-
Notifications
You must be signed in to change notification settings - Fork 61
Add cosmo support to reconfigurator #9134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
164893c to
ee4ae51
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comment! I have a few questions.
Has this been tested on a real cosmo yet? Is it even possible at this point?
Do the SP, RoT, RoT Bootloader, Host OS take roughly the same time to reset on a Cosmo as they do on a Gimlet? We have several timeouts scattered throughout the reconfigurator code based on reset times on a gimlet.
Here we give each component a very generous amount to finish its post update actions. In the simplest of cases like the SP, we just reset the device, but others like the RoT Bootloader and the Host OS have several steps.
omicron/nexus/mgs-updates/src/driver_update.rs
Lines 645 to 682 in 81d8226
| // Timeouts, timeouts: always wrong! | |
| // | |
| // We have to pick some maximum time we're willing to wait for the `post_update` | |
| // hook to complete. In general this hook is responsible for resetting the | |
| // updated target to cause it to boot into its new version, but the details vary | |
| // wildly by device type (e.g., post-update RoT bootloader actions require | |
| // multiple RoT resets) and the expected amount of time also varies wildly | |
| // (e.g., resetting a gimlet SP takes a few seconds, resetting a sidecar SP can | |
| // take 10s of seconds, resetting a sled after a host OS update takes minutes). | |
| fn post_update_timeout(update: &PendingMgsUpdate) -> Duration { | |
| match &update.details { | |
| PendingMgsUpdateDetails::Sp { .. } => { | |
| // We're resetting an SP; use a generous timeout for sleds and power | |
| // shelf controllers (which should take a few seconds) and an even | |
| // more generous timeout for switches (which we've seen take 10-20 | |
| // seconds in practice). | |
| match update.sp_type { | |
| SpType::Sled | SpType::Power => Duration::from_secs(60), | |
| SpType::Switch => Duration::from_secs(120), | |
| } | |
| } | |
| PendingMgsUpdateDetails::Rot { .. } => { | |
| // Resetting the RoT should be quick (a few seconds), but we wait | |
| // for boot info after the reset. | |
| Duration::from_secs(90) | |
| } | |
| PendingMgsUpdateDetails::RotBootloader { .. } => { | |
| // Resetting the bootloader requires multiple RoT resets; give this | |
| // a longer timeout. | |
| Duration::from_secs(210) | |
| } | |
| PendingMgsUpdateDetails::HostPhase1(..) => { | |
| // Resetting a sled takes several minutes (mostly DRAM training); | |
| // give something very generous here to wait for it to come back. | |
| Duration::from_secs(30 * 60) | |
| } | |
| } | |
| } |
Here we poll the device every 10 seconds to see if it has been reset.
omicron/nexus/mgs-updates/src/driver_update.rs
Lines 415 to 443 in 81d8226
| if try_reset { | |
| // We retry this until the component update has been successfully | |
| // updated, or we get some error *other* than a communication error or | |
| // some other transient error. There is intentionally no timeout here. | |
| // If we've staged an update but not managed to reset the device, | |
| // there's no point where we'd want to stop trying to do so. | |
| while let Err(error) = | |
| update_helper.post_update(log, &mut mgs_clients, update).await | |
| { | |
| if error.is_fatal() { | |
| let error = InlineErrorChain::new(&error); | |
| error!(log, "post_update failed"; &error); | |
| return Err(ApplyUpdateError::SpComponentResetFailed( | |
| error.to_string(), | |
| )); | |
| } | |
| // We only care whether the update has completed. We ignore all | |
| // pre-check errors because they could all be transient if a reset | |
| // is in the process of happening. | |
| if let Ok(PrecheckStatus::UpdateComplete) = | |
| update_helper.precheck(log, &mut mgs_clients, update).await | |
| { | |
| break; | |
| } | |
| tokio::time::sleep(RESET_DELAY_INTERVAL).await; | |
| } | |
| } |
For the RoT and RoT bootloader we wait a specific amount of time for boot info to show up
omicron/nexus/mgs-updates/src/rot_updater.rs
Lines 30 to 32 in 81d8226
| pub const WAIT_FOR_BOOT_INFO_TIMEOUT: Duration = Duration::from_secs(120); | |
| const WAIT_FOR_BOOT_INFO_INTERVAL: Duration = Duration::from_secs(10); |
omicron/nexus/mgs-updates/src/rot_updater.rs
Lines 268 to 344 in 81d8226
| /// Poll the RoT asking for its boot information. This confirms that the RoT has | |
| /// been succesfully reset | |
| pub async fn wait_for_boot_info( | |
| log: &Logger, | |
| mgs_clients: &mut MgsClients, | |
| sp_type: SpType, | |
| sp_slot: u16, | |
| timeout: Duration, | |
| ) -> Result<RotState, PostUpdateError> { | |
| let before = Instant::now(); | |
| loop { | |
| debug!(log, "waiting for boot info to confirm a successful reset"); | |
| match mgs_clients | |
| .try_all_serially(log, |mgs_client| async move { | |
| mgs_client | |
| .sp_rot_boot_info( | |
| &sp_type, | |
| sp_slot, | |
| SpComponent::ROT.const_as_str(), | |
| &GetRotBootInfoParams { | |
| version: RotBootInfo::HIGHEST_KNOWN_VERSION, | |
| }, | |
| ) | |
| .await | |
| }) | |
| .await | |
| { | |
| Ok(state) => match state.clone() { | |
| RotState::V2 { .. } | RotState::V3 { .. } => { | |
| debug!(log, "successfuly retrieved boot info"); | |
| return Ok(state.into_inner()); | |
| } | |
| // The RoT is probably still booting | |
| RotState::CommunicationFailed { message } => { | |
| if before.elapsed() >= timeout { | |
| error!( | |
| log, | |
| "failed to get RoT boot info"; | |
| "error" => %message | |
| ); | |
| return Err(PostUpdateError::FatalError { | |
| error: message, | |
| }); | |
| } | |
| info!( | |
| log, | |
| "failed getting RoT boot info (will retry)"; | |
| "error" => %message, | |
| ); | |
| tokio::time::sleep(WAIT_FOR_BOOT_INFO_INTERVAL).await; | |
| } | |
| }, | |
| // The RoT might still be booting | |
| Err(error) => { | |
| let e = InlineErrorChain::new(&error); | |
| if before.elapsed() >= timeout { | |
| error!( | |
| log, | |
| "failed to get RoT boot info"; | |
| &e, | |
| ); | |
| return Err(PostUpdateError::FatalError { | |
| error: e.to_string(), | |
| }); | |
| } | |
| info!( | |
| log, | |
| "failed getting RoT boot info (will retry)"; | |
| e, | |
| ); | |
| tokio::time::sleep(WAIT_FOR_BOOT_INFO_INTERVAL).await; | |
| } | |
| } | |
| } | |
| } |
omicron/nexus/mgs-updates/src/rot_bootloader_updater.rs
Lines 284 to 327 in 81d8226
| /// Poll the RoT asking for its boot information. This is used to check | |
| /// the state for RoT bootloader image errors after RoT is reset | |
| async fn wait_for_stage0_next_image_check( | |
| log: &Logger, | |
| mgs_clients: &mut MgsClients, | |
| sp_type: SpType, | |
| sp_slot: u16, | |
| timeout: Duration, | |
| ) -> Result<Option<RotImageError>, PostUpdateError> { | |
| debug!(log, "attempting to verify image validity"); | |
| match wait_for_boot_info(log, mgs_clients, sp_type, sp_slot, timeout).await | |
| { | |
| Ok(state) => match state { | |
| // The minimum we will ever return is v3. | |
| // Additionally, V2 does not report image errors, so we cannot | |
| // know with certainty if a signature check came back with errors | |
| RotState::V2 { .. } => { | |
| let error = "unexpected RoT version: 2".to_string(); | |
| error!( | |
| log, | |
| "failed to get RoT boot info"; | |
| "error" => &error | |
| ); | |
| return Err(PostUpdateError::FatalError { error }); | |
| } | |
| RotState::V3 { stage0next_error, .. } => { | |
| debug!(log, "successfully completed an image signature check"); | |
| return Ok(stage0next_error); | |
| } | |
| // This is unreachable because wait_for_boot_info loops for some | |
| // time if it encounters `CommunicationFailed`, and if it hits the | |
| // timeout, it will return an error. | |
| RotState::CommunicationFailed { message } => { | |
| error!( | |
| log, | |
| "failed to get RoT boot info"; | |
| "error" => %message | |
| ); | |
| return Err(PostUpdateError::FatalError { error: message }); | |
| } | |
| }, | |
| Err(error) => return Err(error), | |
| } | |
| } |
omicron/nexus/mgs-updates/src/rot_bootloader_updater.rs
Lines 269 to 277 in 81d8226
| // We wait for boot info to ensure a successful reset | |
| wait_for_boot_info( | |
| log, | |
| mgs_clients, | |
| update.sp_type, | |
| update.slot_id, | |
| WAIT_FOR_BOOT_INFO_TIMEOUT, | |
| ) | |
| .await?; |
Sorry for the late review btw! I was out for a couple of weeks
| // We assume a valid model ID for sleds | ||
| model: match sp_id.type_ { | ||
| SpType::Sled => GIMLET_SLED_MODEL.to_string(), | ||
| _ => format!("dummy_{}", sp_id.type_), | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not too complex, it would be really nice to have some tests using a cosmo (unit or reconfigurator-cli ones).
| "913-0000023" => Some(Self::Cosmo), | ||
| "913-0000019" | "913-0000006" => Some(Self::Gimlet), | ||
| COSMO_SLED_MODEL => Some(Self::Cosmo), | ||
| GIMLET_SLED_MODEL | "913-0000006" => Some(Self::Gimlet), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realise this was already here, but is 913-0000006 just a Gimlet ID that is used for test or legacy devices? Or is this used elsewhere? Do you mind adding a comment or so we know where this ID comes from, and where it is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, I actually think that's incorrect now that I'm double checking it.
@dancrossnyc When you were working on #9016 where did you grab "913-0000006" from? At least according to pilot that's a sidecar serial https://github.com/oxidecomputer/pilot/blob/main/common/src/sp.rs#L135
Unfortunately we've been blocked on cosmo device ID. Without a device ID, we can't connect via sprockets which means we can't run RSS which means we can't run any actual services to run reconfigurator. This is also why I've led this PR sit here.
This is a slightly tricky question to answer. We know that we've seen Cosmo take more time than Gimlet for some things but we also haven't had a chance to finish profiling to see if this is expected or not. I think this will be something that will become more obvious as we actually start testing on actual hardware (thank you for the pointers on where to check!) |
NB: I have just completed signing on the two Cosmos in question. They're now pending reinstallation into london; see oxidecomputer/meta#776! |
No description provided.